library(ggplot2)
library(GGally)
library(plotly)
Xh <- c(1, 0, 0, 0)
XtXinv <-
matrix(rbind(
c(0.087,-0.014,-0.035,-0.004),
c(-0.014, 0.115,-0.012,-0.052),
c(-0.035,-0.012, 0.057,-0.014),
c(-0.004,-0.052,-0.014, 0.050)
), nrow = 4, ncol = 4)
XtXinv
## [,1] [,2] [,3] [,4]
## [1,] 0.087 -0.014 -0.035 -0.004
## [2,] -0.014 0.115 -0.012 -0.052
## [3,] -0.035 -0.012 0.057 -0.014
## [4,] -0.004 -0.052 -0.014 0.050
SE.predh <- sqrt(1.040 * (1 + t(Xh) %*% XtXinv %*% Xh))
SE.predh
## [,1]
## [1,] 1.06324
qt(1 - (0.05 / 2), 30 - 4)
## [1] 2.055529
c(0.9918 - (qt(1 - (0.05 / 2), 30 - 4) * SE.predh), 0.9918 + (qt(1 - (0.05 / 2), 30 - 4) * SE.predh))
## [1] -1.193722 3.177322
property <- read.table("property.txt")
colnames(property) <-
c("Ren.Rate", "Age", "Exp", "Vac.Rate", "Sq.Foot")
property
For the variable types, Age, Vacancy Rate, and Total Square Footage are type doubles, Operating Expenses and Rental Rate are integers. Now we observe the distribution of each variable and some summary statistics.
ggplot(data = property, aes(x = Ren.Rate)) + geom_histogram(bins = 10, fill = "#55A3C0")
summary(property$Ren.Rate)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.50 14.00 15.00 15.14 16.50 19.25
We can see approximate symmetry with light tails in the age variable. The mean and the median are very close to each other.
ggplot(data = property, aes(x = Age)) + geom_histogram(bins = 10, fill = "#556EC0")
summary(property$Age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 2.000 4.000 7.864 15.000 20.000
For the Age variable, we can see a bimodal distribution. The majority of the data sits between 0 and 5 thousand and 10 and 20 thousand. Because of this spread, the median is lower, towards where a larger bulk of the data is. The mean however is leaning towards the center of the data, being skewed by the concentration of larger values.
ggplot(data = property, aes(x = Exp)) + geom_histogram(bins = 13, fill = "#7255C0")
summary(property$Exp)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 8.130 10.360 9.688 11.620 14.620
With our Opertating Expenses variable, we observe a left skewed distribution. Most of the data is within our IQR, from about 8.13 to 11.62.
ggplot(data = property, aes(x = Vac.Rate)) + geom_histogram(bins = 9, fill = "#7255C0")
summary(property$Vac.Rate)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.03000 0.08099 0.09000 0.73000
In the Vacancy Rate data, we can see a very strong right skew. Most of our data does not exceed 0.1. Because of the skewing, we can see that the median is smaller than the mean.
ggplot(data = property, aes(x = Sq.Foot)) + geom_histogram(bins = 7, fill = "#7255C0")
summary(property$Sq.Foot)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 27000 70000 129614 160633 236000 484290
With our Square Footage variable, we can again see a right skew. Unlike Vacancy Rate, this skew is not as strong as it has some pull towards the center, as well as a heavier tail.
ggplotly(ggpairs(property, title = "Correlogram of Property Data"))
From the plot, we can see that the variables with the highest correlation is Rental Rate and Square Footage. The variables with the smallest correlation are Rental Rate and Vacancy Rate. This makes sense, the scatter plot for Rental Rate and Square Footage has the most linear resemblence while Rental Rate and Vacancy Rate show almost no relationship. We can see that Vacancy Rate is negatively correlated with Age and Operating Expenses.
model1 <- lm(Ren.Rate ~ Age + Exp + Vac.Rate + Sq.Foot, data = property)
summary(model1)
##
## Call:
## lm(formula = Ren.Rate ~ Age + Exp + Vac.Rate + Sq.Foot, data = property)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.1872 -0.5911 -0.0910 0.5579 2.9441
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.220e+01 5.780e-01 21.110 < 2e-16 ***
## Age -1.420e-01 2.134e-02 -6.655 3.89e-09 ***
## Exp 2.820e-01 6.317e-02 4.464 2.75e-05 ***
## Vac.Rate 6.193e-01 1.087e+00 0.570 0.57
## Sq.Foot 7.924e-06 1.385e-06 5.722 1.98e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.137 on 76 degrees of freedom
## Multiple R-squared: 0.5847, Adjusted R-squared: 0.5629
## F-statistic: 26.76 on 4 and 76 DF, p-value: 7.272e-14
The least squares estimates are the following:
model1$coefficients
## (Intercept) Age Exp Vac.Rate Sq.Foot
## 1.220059e+01 -1.420336e-01 2.820165e-01 6.193435e-01 7.924302e-06
The fitted regression function: \[ Y_i = 12.220 - 0.142 X_{i1} + 0.282X_{i2} + 0.619X_{i3} + 7.924e^{-06}X_{i4} \]
The \(R^2\) for the model is \(0.5847\) and the \(R^2_{adj}\) is \(0.5629\).
We can print the ANOVA table to gather more information.
#MSE
anova(model1)
The ANOVA table shows us that our \(MSE = 1.293\).
#Residuals vs Fitted Values
ggplot() + aes(x = model1$fitted.values, model1$residuals) + geom_point() + labs(x = "Fitted Values", y = "Residuals", title = "Residuals vs. Fitted Values") + geom_abline(intercept = 0, slope = 0)
## Normal Q-Q plot
ggplot(model1, aes(sample = model1$residuals)) + stat_qq() + stat_qq_line() + labs(x = "Theoretical Quantiles", y = "Sample Quantiles", title = "Normal Q-Q")
# Residual Boxplot
ggplot() + aes(x = model1$residuals) + geom_boxplot() + labs(x = "Residuals")
We can see from the Residual vs Fitted Value plot that we have random scatter with no sign of a pattern, suggesting that a linear model is the best model for this data. Our Normal QQ Plot shows us that our distribution is approximately normal with signs of light tails. Our box plot shows us that our data is approximatly symmetric, with only 4 outliers in the data set.
#Residuals vs Age
ggplot() + aes(x = property$Age, y = model1$residuals) + geom_point() + labs(x = "Age", y = "Residuals", title = "Residuals vs. Age") + geom_abline(intercept = 0, slope = 0)
# Residuals vs Operating Expenses
ggplot() + aes(x = property$Exp, y = model1$residuals) + geom_point() + labs(x = "Operating Expenses", y = "Residuals", title = "Residuals vs. Operating Expenses") + geom_abline(intercept = 0, slope = 0)
# Residuals vs Operating Expenses
ggplot() + aes(x = property$Vac.Rate, y = model1$residuals) + geom_point() + labs(x = "Vacancy Rate", y = "Residuals", title = "Residuals vs. Vacancy Rate") + geom_abline(intercept = 0, slope = 0)
# Residuals vs Square Footage
ggplot() + aes(x = property$Sq.Foot, y = model1$residuals) + geom_point() + labs(x = "Square Footage", y = "Residuals", title = "Residuals vs. Square Footage") + geom_abline(intercept = 0, slope = 0)
### Interactions
#Residuals vs Age*Expenses
ggplot() + aes(x = property$Age * property$Exp, y = model1$residuals) + geom_point() + labs(x = "Age*Expenses", y = "Residuals", title = "Residuals vs. Age*Expenses") + geom_abline(intercept = 0, slope = 0)
#Residuals vs Age*Vac.Rate
ggplot() + aes(x = property$Age * property$Vac.Rate, y = model1$residuals) + geom_point() + labs(x = "Age*Vac.Rate", y = "Residuals", title = "Residuals vs. Age*Vac.Rate") + geom_abline(intercept = 0, slope = 0)
#Residuals vs Age*Sq.Foot
ggplot() + aes(x = property$Age * property$Sq.Foot, y = model1$residuals) + geom_point() + labs(x = "Age*Sq.Foot", y = "Residuals", title = "Residuals vs. Age*Sq.Foot") + geom_abline(intercept = 0, slope = 0)
#Residuals vs Expenses*VacRate
ggplot() + aes(x = property$Exp * property$Vac.Rate, y = model1$residuals) + geom_point() + labs(x = "Exp*Vac.Rate", y = "Residuals", title = "Residuals vs. Exp*Vac.Rate") + geom_abline(intercept = 0, slope = 0)
#Residuals vs Expenses*Sq.Foot
ggplot() + aes(x = property$Exp * property$Sq.Foot, y = model1$residuals) + geom_point() + labs(x = "Exp*Sq.Foot", y = "Residuals", title = "Residuals vs. Exp*Sq.Foot") + geom_abline(intercept = 0, slope = 0)
#Residuals vs Vac.Rate*Sq.Foot
ggplot() + aes(x = property$Vac.Rate * property$Sq.Foot, y = model1$residuals) + geom_point() + labs(x = "Vac.Rate*Sq.Foot", y = "Residuals", title = "Residuals vs. Vac.Rate*Sq.Foot") + geom_abline(intercept = 0, slope = 0)
We have six interaction terms. When we observe plots with interactions with Vacancy Rate, we notice that a lot of the observations are very close to zero. Because of this, the data does not have random scatter, but instead, high concentration to the left with outliers on the right. We see this same scenario with interactions with Total Square Footage variable, however not as strong.
In our non-interaction plots, we can see that variables Age and Vacancy Rate have clusters of data, which violates our desire for random scatter.
To test our Beta coefficients, we have: \[ H_0: \beta_k = 0 \\ H_a: \beta_k \neq 0 \] With test statistic and null distribution: \[ T^* = \frac{\hat{\beta_k}}{s(\hat{\beta_k})} \\ T^* \sim t_{n-p} \]
We reject \(H_0\) if \(T^* > t(1 - \alpha/2, n-p)\).
Thankfully, the summary of our model does this test for us with the p-values. For the intercept, age, operational expenses, and square footage with respective p-values (< 2e-16, 3.89e-09, 2.75e-05, 1.98e-07), we reject \(H_0\), concluding that the coefficients are significantly diffrerent than 0. However, for our Vacancy Rate, with p-value 0.57, we fail to reject \(H_0\), leading us to conclude that Vacancy Rate does not help us predict Rental Rates. ALl other variables we conclude are useful in modeling Rental Rates.
anova(model1)
\[ SSE = 98.231 \\ df(SSE) = 76 \\ SSR = 14.819 + 72.802 + 8.381 + 42.325 = 138.327\\ df(SSR) = 4\\ SSTO = 98.231 + 138.327 = 236.558\\ df(SSTO) = 80 \]
We test the hypothesis for the regrssion relation:
\[ H_0: \beta_1 = \dots = \beta_{p-1} = 0 \\ H_a: \text{Not all}\ \beta_ks \ \text{equal to zero} \]
\[ F^* = \frac{MSR}{MSE} \\ F^* \sim_{H_0} F_{(p-1, n-p)} \]
We reject \(H_0\) if \(F^* > F(1 - \alpha, p-1, n-p)\).
Fstar <- (138.327 / 4) / (98.231 / 76)
Fstar
## [1] 26.75543
pf(Fstar, 4, 76, lower.tail = FALSE)
## [1] 7.272536e-14
Since our p-value is much smaller than our \(\alpha = 0.01\), we reject our null hypothesis. We can conclude that not all the \(\beta_ks\) equal zero.
model2 <- lm(Ren.Rate ~ Age + Exp + Sq.Foot, data = property)
model2$coefficients
## (Intercept) Age Exp Sq.Foot
## 1.237058e+01 -1.441646e-01 2.671670e-01 8.178210e-06
The fitted regression function for model2: \[ Y_i = 12.37 - 0.144 X_{i1} + 0.267X_{i2} + 8.178e^{-06}X_{i4} \]
We would decide to make an new model without for \(X_3\) since we concluded in part f that \(X_3\) was not significant to the modeling of rental rates.
summary(model2)
##
## Call:
## lm(formula = Ren.Rate ~ Age + Exp + Sq.Foot, data = property)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.0620 -0.6437 -0.1013 0.5672 2.9583
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.237e+01 4.928e-01 25.100 < 2e-16 ***
## Age -1.442e-01 2.092e-02 -6.891 1.33e-09 ***
## Exp 2.672e-01 5.729e-02 4.663 1.29e-05 ***
## Sq.Foot 8.178e-06 1.305e-06 6.265 1.97e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.132 on 77 degrees of freedom
## Multiple R-squared: 0.583, Adjusted R-squared: 0.5667
## F-statistic: 35.88 on 3 and 77 DF, p-value: 1.295e-14
anova(model2)
The value of \(R^2 = 0.583\) and the \(R^2_{adj} = 0.5667\). The \(MSE = 1.281\).
sumModel1 <- summary(model1)
sumModel2 <- summary(model2)
# Print the std errors
sumModel1$coefficients[,2]
## (Intercept) Age Exp Vac.Rate Sq.Foot
## 5.779562e-01 2.134261e-02 6.317235e-02 1.086813e+00 1.384775e-06
sumModel2$coefficients[,2]
## (Intercept) Age Exp Sq.Foot
## 4.928469e-01 2.092012e-02 5.729487e-02 1.305377e-06
We can see that the std. error of the coefficients are smaller for each coefficient. If we had constructed intervals for model 1 with the larger standard errors, we would have found the intervals to be wider than the intervals made for model 2.
tval <- qt(1 - (0.01/2), nrow(property) - 4)
#confidence intervals
t(
rbind(
sumModel2$coefficients[, 1] - tval * sumModel2$coefficients[, 2],
sumModel2$coefficients[, 1] + tval * sumModel2$coefficients[, 2]
)
)
## [,1] [,2]
## (Intercept) 1.106888e+01 1.367229e+01
## Age -1.994188e-01 -8.891046e-02
## Exp 1.158399e-01 4.184941e-01
## Sq.Foot 4.730452e-06 1.162597e-05
# Prediction for model 1
predict(model1, data.frame(Age = 4, Exp = 10, Vac.Rate = 0.1, Sq.Foot = 80,000), interval = "prediction", level = 0.99)
## fit lwr upr
## 1 14.51518 11.42732 17.60305
# Prediction for model 2
predict(model2, data.frame(Age = 4, Exp = 10, Sq.Foot = 80,000), interval = "prediction", level = 0.99)
## fit lwr upr
## 1 14.46625 11.40128 17.53121
# Range for interval 1
17.60305 - 11.42732
## [1] 6.17573
# Range for interval 2
17.53121 - 11.40128
## [1] 6.12993
We can see that the interval for model2 is slightly more narrow than than the interval for model1.
Overall, we would prefer to use model 2. We have a lower MSE, statistically signifanct predictors for every variable in the model, and higher \(R^2_{adj}\) value. A model with less variables is also easier to interpret.